Efficient Algorithms for Grouping Data to Improve Data Quality
نویسندگان
چکیده
Improving and maintaining data quality has become a critical issue for many companies and organizations because poor data degrades organizational performance whereas quality data results in cost saving and customer satisfaction. Activities such as identifying and removing ”duplicate” database records from a single database, and correlating records, which identify the same real world ”entity”, from different databases are used routinely to improve data quality. Due to the large size of the data sources having several hundred millions to several billions records, and continuously growing, efficient techniques and algorithms are needed. One approach to speed up the processing is to use a two-step process, where potential candidate records are grouped together in step one and each group is further processed and analyzed in step two. The record grouping problem is a formal formulation of what needs to be done in step one. This paper introduces a record grouping problem called transitive closure problem, and proposes algorithms to solve the problem. The proposed algorithms have been implemented efficiently in several ways. The paper reports on the empirical study of the implementations of the proposed algorithms.
منابع مشابه
Achieving Simultaneous Spectrum Utilization and Revenue Improvements in Practical Wireless Spectrum Auctions
Spectrum is a valuable, scarce and finite natural resource that is needed for many different applications, so efficient use of the scarce radio spectrum is important for accommodating the rapid growth of wireless communications. Spectrum auctions are one of the best-known market-based solutions to improve the efficiency of spectrum use. However, Spectrum auctions are fundamentally differen...
متن کاملA Reliable Routing Algorithm for Delay Sensitive Data in Body Area Networks
Wireless body Area networks (WBANs) include a number of sensor nodes placed inside or on the human body to improve patient health and quality of life. Ensuring the transfer and receipt of data in sensitive data is a very important issue. Routing algorithms should support a variety of service quality such as reliability and delay in sending and receiving data. Loss of information or excessive da...
متن کاملAn Improved SSPCO Optimization Algorithm for Solve of the Clustering Problem
Swarm Intelligence (SI) is an innovative artificial intelligence technique for solving complex optimization problems. Data clustering is the process of grouping data into a number of clusters. The goal of data clustering is to make the data in the same cluster share a high degree of similarity while being very dissimilar to data from other clusters. Clustering algorithms have been applied to a ...
متن کاملAn Improved SSPCO Optimization Algorithm for Solve of the Clustering Problem
Swarm Intelligence (SI) is an innovative artificial intelligence technique for solving complex optimization problems. Data clustering is the process of grouping data into a number of clusters. The goal of data clustering is to make the data in the same cluster share a high degree of similarity while being very dissimilar to data from other clusters. Clustering algorithms have been applied to a ...
متن کاملA Tool for Evaluating Strategies for Grouping of Biological Data
During the last decade an enormous amount of biological data has been generated and techniques and tools to analyze this data have been developed. Many of these tools use some form of grouping and are used in, for instance, data integration, data cleaning, prediction of protein functionality, and correlation of genes based on microarray data. A number of aspects influence the quality of the gro...
متن کامل